Skip to content

fix(swe-bench-pro): make the default rk run work out of the box#26

Merged
kentwelcome merged 1 commit into
mainfrom
fix/swe-bench-pro-default-run
Jun 25, 2026
Merged

fix(swe-bench-pro): make the default rk run work out of the box#26
kentwelcome merged 1 commit into
mainfrom
fix/swe-bench-pro-default-run

Conversation

@kentwelcome

Copy link
Copy Markdown
Contributor

Problem

A default swe-bench-pro run — rk run examples/specs/swe-bench-pro-spacedock-codex.yaml with the default --materialize bind — hit two blockers before the solver ever ran, each requiring a per-spec/per-invocation workaround to get past:

  1. Docker build fails: failed to read dockerfile: open Dockerfile: no such file or directory
  2. Agent setup fails: codex runtime adapter cannot honor unsupported harbor_agent_kwargs field 'max_turns'

Root causes & fixes

1. Symlinked Docker build context. In bind/link mode the materializer symlinked the whole task tree, including environment/Dockerfile. docker compose build uses the view's environment/ as its build context, and BuildKit cannot read a Dockerfile that symlinks outside the context — so every build-from-source benchmark broke under the default mode. Fix: always materialize the environment/ build context as real files, even in link mode (mirrors the existing task.toml carve-out). Bulk task files still symlink, preserving bind's no-eager-duplication benefit.

2. max_turns on codex. The example spec set max_turns: 400, but the codex runtime only accepts the default (200); any other value raises (intentional — claude honors max_turns, codex does not). Fix: set the example spec to 200 with an explanatory comment, and make the rejection message actionable (point users at the default + override_timeout_sec/max_timeout_sec for wall-clock budget).

3. Leakage hardening (from a codex adversarial review of this diff). Copying the build context follows symlinks (shutil.copy2), and the name-based deny filter only sees a link's own path — so a disguised symlink (environment/leak.patch -> ../gold_patch.diff) could embed a denied answer artifact's bytes under an allowed view path. The materializer now resolves any source symlink and re-applies the deny check + source-containment, dropping denied/out-of-tree targets in both copy and link modes.

Verification

  • End-to-end: a default-settings N=1 smoke now builds, solves, and scores reward 1.0 on the ansible task (no --materialize copy, no spec edits).
  • Tests: three new unit tests (link-mode build context stays real; max_turns rejection is actionable; disguised symlink to a denied target is dropped — proven to fail pre-fix). 104 passing across the materialize / leakage / translate / runtime-adapter / swe-bench-pro surfaces incl. the integration explain test.
  • Did not weaken the intentional "refuse to silently drop" codex guard.

🤖 Generated with Claude Code

…orkarounds

A default swe-bench-pro run (`rk run examples/specs/swe-bench-pro-spacedock-codex.yaml`,
default `--materialize bind`) hit two blockers before the solver ever ran:

1. Docker build failed: `failed to read dockerfile: no such file or directory`.
   In bind/link mode the materializer symlinked the whole task tree, including
   `environment/Dockerfile`. `docker compose build` uses the view's `environment/`
   as its build context and BuildKit cannot read a Dockerfile that symlinks
   outside the context — so every build-from-source benchmark broke under the
   default mode. Fix: always materialize the `environment/` build context as real
   files, even in link mode (mirrors the existing task.toml carve-out); bulk task
   files still symlink, preserving bind's no-eager-duplication benefit.

2. Agent setup failed: `codex runtime adapter cannot honor ... 'max_turns'`.
   The example spec set `max_turns: 400`, but the codex runtime only accepts the
   default (200); any other value raises (intentional — claude honors max_turns,
   codex does not). Fix: set the example spec to 200 with a comment, and make the
   rejection message actionable (point users at the default + timeout budgeting).

Leakage hardening (from codex review): copying the build context follows symlinks
(shutil.copy2), and the name-based deny filter only sees a link's own path — so a
disguised symlink (`environment/leak.patch -> ../gold_patch.diff`) could embed a
denied answer artifact's bytes under an allowed view path. The materializer now
resolves any source symlink and re-applies the deny check + source containment,
dropping denied/out-of-tree targets in both copy and link modes.

Verified end-to-end: a default-settings N=1 smoke now builds, solves, and scores
(reward 1.0 on the ansible task). Unit tests cover all three (link-mode build
context stays real; max_turns rejection is actionable; disguised symlink to a
denied target is dropped, proven to fail pre-fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 25, 2026 03:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two “out of the box” blockers for running the SWE-bench-pro Codex example with the default rk run materialization mode, by making Harbor task view materialization Docker-build friendly and making Codex’s max_turns rejection message actionable.

Changes:

  • Materialize the environment/ Docker build context as real files even in view_mode="link", and harden leakage filtering by re-checking resolved symlink targets (deny-globs + source-tree containment).
  • Improve Codex runtime adapter errors for unsupported harbor_agent_kwargs.max_turns with an actionable hint (keep 200, use timeouts for wall-clock budget).
  • Add unit tests covering link-mode build-context materialization, Codex actionable error messaging, and disguised-symlink leakage prevention; update the SWE-bench-pro Codex example spec to use max_turns: 200 with guidance.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/razorback/harbor_tasks/materialize.py Forces environment/ to be copied as real files in link mode; prevents symlink-target smuggling of denied/out-of-tree content.
src/razorback/agents/_runtime/codex.py Adds a targeted, actionable hint when rejecting non-default max_turns for Codex.
tests/unit/test_harbor_task_view_materializer.py Adds regression tests for link-mode Docker build context behavior and symlink leakage hardening.
tests/unit/test_runtime_adapters.py Adds a test ensuring the Codex max_turns rejection message contains actionable guidance.
examples/specs/swe-bench-pro-spacedock-codex.yaml Updates the example to max_turns: 200 and documents using timeouts for wall-clock budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kentwelcome kentwelcome merged commit 2d8a1ae into main Jun 25, 2026
1 check passed
@kentwelcome kentwelcome deleted the fix/swe-bench-pro-default-run branch June 25, 2026 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants